Convolutions for Spatial Interaction Modeling
Zhaoen Su, Chao Wang, David Bradley, Carlos Vallespi-Gonzalez
Carl Wellington, Nemanja Djuric
{suzhaoen, chao.wang, dbradley, cvallespi, cwellington, ndjuric}@aurora.tech
Abstract
In many different ﬁelds interactions between objects play
a critical role in determining their behavior. Graph neural
networks (GNNs) have emerged as a powerful tool for mod-
eling interactions, although often at the cost of adding con-
siderable complexity and latency. In this paper, we consider
the problem of spatial interaction modeling in the context
of predicting the motion of actors around autonomous ve-
hicles, and investigate alternatives to GNNs. We revisit 2D
convolutions and show that they can demonstrate compa-
rable performance to graph networks in modeling spatial
interactions with lower latency, thus providing an effective
and efﬁcient alternative in time-critical systems. Moreover,
we propose a novel interaction loss to further improve the
interaction modeling of the considered methods.
1. INTRODUCTION
Interactions or relations between objects are critical for
understanding both the individual behaviors and collective
properties of many systems. Conceptually, these interac-
tions can be modeled with graph structures that comprise
a set of objects (nodes) and their relationships (edges).
By applying deep learning techniques, graph neural net-
works (GNNs) have demonstrated great expressive power
in modeling interactions in various ﬁelds, including physi-
cal science [3,13,33,36], social science [18,25], knowledge
graphs [17], and other research areas [21,28,30,40].
Some of the interactions strongly depend on geometries,
such as the Euclidean distance and relative directions be-
tween objects, which in this work is referred to as spatial
interactions. One problem where spatial interaction is criti-
cal is motion forecasting, a key task in the ﬁelds of computer
vision, robotics in general and autonomous driving (AD) in
particular. Speciﬁcally, anticipating the future movements
of an object requires understanding not only its history, but
also the object’s interactions with other objects and its en-
vironment. These interactions strongly depend on relative
spatial features between objects, such as their relative loca-
tions, orientations, and velocities.
Graphs have achieved success in modeling spatial inter-
action [24,34,35,37]. Features of individual objects are typ-
ically encoded into attributes of graph nodes, and the graph
edges are built by passing node attributes and the relative
geometries of the node pair through a mapping function.
GNNs follow a message passing scheme, where each node
aggregates features of its neighboring nodes to compute its
new node attributes. These approaches have two character-
istics, as seen in the experimental section: (1) the relative
spatial features are not represented implicitly in the graph
and need to be handcrafted into the graph edge features; (2)
even a single iteration of GNN may be slower than convo-
lutional neural networks (CNNs), which makes GNNs less
suitable for applications in ﬁelds such as AD where fast in-
ference is safety-critical.
Alternatively, data structures for 2D or 3D convolutional
operations are presented in common grid forms, such as
through voxelization in 3D, rasterization in 2D bird’s-eye
view (BEV), or as intermediate CNN features. Importantly,
spatial relations are intrinsically represented in these Eu-
clidean space. Thus, they theoretically allow spatial rela-
tions between objects to be learned by CNNs with sufﬁ-
ciently large receptive ﬁelds [12]. In other words, CNNs
have the potential to model spatial interactions. However,
even though deep CNN backbones with large receptive
ﬁelds are widely utilized in trajectory forecasting models,
research has shown that adding a GNN after a CNN back-
bone can still improve interaction modeling [5,34,35]. This
suggests the CNN backbones often do not fulﬁll their the-
oretical potential in modeling spatial interactions between
the trafﬁc actors.
In this work, we consider spatial interaction model-
ing through 2D convolutions and compare them to GNNs
within the context of motion forecasting for AD. A key de-
terminant of future motion for other drivers is the avoid-
ance of collisions, which represents a critical interaction
that we model explicitly. Collisions can be approximated as
geometric overlapping, which provides unambiguous deﬁ-
nitions for interaction metrics. We evaluate the methods on
large-scale real-world AD data to draw general conclusions.
Our contributions are summarized below:
arXiv:2104.07182v3  [cs.CV]  8 Jun 2022

• we identify three components to facilitate modeling
spatial interaction with convolutions: (1) large actor-
centric interaction region, (2) projecting feature maps
into the actor’s frame of reference, and (3) aggregation
of per-actor feature maps using convolutions;
• we perform empirical studies to compare interaction
modeling using convolutions and graphs, and ﬁnd that
(1) CNNs can perform similarly to or better than
GNNs; (2) adding the CNN can considerably improve
interaction modeling even when a GNN is used; (3)
adding a GNN demonstrates only minor additional
gain when the convolutional approach is already used.
• we propose and study a novel interaction loss.
2. RELATED WORK
2.1. Motion forecasting
There exists a signiﬁcant body of work on forecasting the
motion of trafﬁc actors. An input to the forecasting mod-
els can be a sequence of past actor states such as positions,
headings, or velocities [8, 9, 11, 14, 20, 34], or a sequence
of raw sensor data such as LiDAR or radar returns [6, 27]
where joint object detection and motion forecasting are per-
formed in an autonomous vehicle’s (AV) frame of reference.
While the latter approach may accelerate inference and joint
learning by sharing common CNN features among all ac-
tors, these single-stage models could beneﬁt from actor-
centric features. Two-stage models [5,10] address this issue
by using a ﬁrst stage to detect the actors and extract features,
and then adding a second stage in the frame of reference of
detected actors. The two stages are then learned jointly in
an end-to-end fashion. The interaction modeling study in
this paper adapts a two-stage architecture. Note that the de-
signs used in the study, including rotated region of interest
(RROI) [29] and actor-centric design [5, 10, 11], have been
developed and applied in previous research in a context dif-
ferent from interaction modeling. However, our empirical
study demonstrates that utilizing these ideas allows convo-
lutions to effectively model spatial interaction as well.
2.2. Interaction modeling
GNNs have recently been applied to explicitly express
interactions in motion forecasting. NRI [24] models the
interaction between actors by using GNNs to infer inter-
actions while simultaneously learning dynamics. Vector-
Net [14] and CAR-Net [35] model actor-context interac-
tions.
Closely related to our work, SpaGNN [5] is also
a two-stage detection-and-forecasting model that builds a
graph for vehicles in the second stage to model vehicle-
vehicle interaction. The GNN models used for comparison
in the study of this paper follow the same design.
Beyond graph models, grid-based spatial relations have
been explored using social pooling approaches [1, 8, 16],
where pooling is used to capture the impact of surrounding
actors in the recurrent architecture. In social-LSTM [1,16],
the LSTM cell receives pooled spatial hidden states from
the LSTM cells of neighbors that are embedded into a grid.
Besides the parameter-free pooling, convolutional layers
have also been explored [8]. By contrast, our proposal is
fully convolutional. Moreover, these approaches pool the
spatial context of interacting actors while excluding the ac-
tor itself, thus the actor-context interaction is not directly
modeled in the process.
2.3. Interaction metrics
It is interesting to note that while various techniques
have been developed to model spatial interaction, most prior
work reports motion forecasting displacement errors. As
shown in this study, reducing displacement errors does not
necessarily indicate improvement in interaction modeling
for a motion forecasting task. An alternative metric that can
more explicitly indicate the level of interaction modeling
is to measure whether vehicle motion forecasts incorrectly
predict overlap with other vehicles [5, 34]. In this work,
we also propose vehicle-obstacle overlap rate within mo-
tion forecasts as another measure for interaction modeling.
3. METHODOLOGY
In this section we formulate the motion forecasting prob-
lem, followed by a discussion of two approaches to interac-
tion modeling: implicitly through 2D convolutions and ex-
plicitly through graphs. Fig. 1 illustrates the architectures of
the considered end-to-end models that jointly solve tasks of
object detection and motion forecasting, taking BEV repre-
sentation of the sensor data as an input and outputting both
object detections and their future trajectories. We empha-
size that we purposefully choose a commonly used input
representation, neural network design, and loss functions in
order to focus on understanding the interaction modeling
aspect of these approaches. Moreover, to quantify the anal-
ysis we limit our discussion to vehicle actors (see Appendix
for analysis of other actor types).
3.1. Problem formulation
Given input data comprising the past and current in-
formation of V interacting actors and the environment, a
model outputs their current and future states x represented
as X0:H = {xv
t , v = 1, . . . , V, t = T0, . . . , TH}. As men-
tioned previously, our study considers raw sensor data as an
input to the model. Following the joint detection and fore-
casting architecture [6, 10], we encode the sensor data by
voxelizing and stacking a sequence of current and P past

Voxelized LiDAR point clouds
Feature map
Feature
Extractor
Rasterized map
Feature 
vector
ICNN
ICNN
ICNN
Interaction region 
in actor frame
ICNN
ICNN
ICNN
(a) Single-stage
(b) + Interaction 
Convolutional 
Module (ICM)
(c) + GNN
4x downsample
Figure 1. Three model architectures in a scene illustrated with three vehicle actors and one obstacle (denoted by the white spot). All
models share the same ﬁrst-stage design shown from left to middle: input is a BEV raster image comprising past and current point clouds
and a semantic map in the AV frame. Through a CNN feature extractor we obtain a 4× downsampled feature map in the AV frame.
(a) Single-stage baseline: Object detection and trajectory forecasting are performed at a pixel-level. (b) Adding the proposed Interaction
Convolutional Module (ICM). For each actor we deﬁne an interaction region (IR) in the actor frame that is used to crop an area from the
feature map. Through the weight-sharing interactive CNN (ICNN) a feature vector is aggregated for each actor and then utilized to predict
a future trajectory in its frame. (c) Adding GNN into the architecture shown in (b).
LiDAR point clouds around AV at time T0 in BEV rep-
resentation, as well as rasterizing semantic map that pro-
vides an additional environmental prior, which are used as
the model input.
The 2D detection at time T0 for each
actor is parameterized by a bounding box represented as
(cx, cy, cos θ, sin θ, w, l), denoting the x and y coordinates
of the actor’s centroid, the cosine and sine of its heading an-
gle, and the width and length of the box, respectively. As-
suming rigid SE2 transformations, future trajectories can be
represented as a sequence of tuples (cxt, cyt, cos θt, sin θt),
with t ∈{T1, . . . , TH} [38].
3.2. Feature extraction and loss functions
As illustrated in Fig. 1a, the ﬁrst stage of the joint model
detects objects and extracts features. From the input BEV
raster, a 4× downsampled feature map is extracted by a
deep CNN that follows common design (see Appendix for
the complete network design). It consists of 3 operations:
(1) convolutional block (ConvB) including a convolution
(kernel size 3×3), batch normalization, and ReLU option-
ally; (2) ResNet v2 block (ResB) [19]; and (3) upsampling
using bi-linear interpolation. Features are processed at mul-
tiple scales to provide larger receptive ﬁelds for capturing
wider context and past motion of the actors.
Following the computation of the BEV feature map,
classiﬁcation and regression are performed on the 1D fea-
ture vector for each grid cell. Through a fully-connected
(FC) layer and a softmax function, we obtain the likeli-
hood pc of existence of a vehicle actor whose center is lo-
cated in the cell c. We use focal loss ℓf [26] to address
the foreground/background imbalance. Through a separate
FC layer, the network at the same time regresses the de-
tection bounding boxes X0. The centroid and heading are
relative to the cell center and the AV heading, respectively.
Then, the ﬁrst-stage detection loss is given as follows (the
hat-notation ˆ∗indicates the ground-truth targets)
Ldet =
X
c∈all
ℓf(ˆpc, pc) +
X
v∈veh

ℓ1(ˆlv −lv)
+ ℓ1( ˆwv −wv) + ℓ1(ˆcv
x0 −cv
x0) + ℓ1(ˆcv
y0 −cv
y0)
+ ℓ1(cos ˆθv
0 −cos θv
0) + ℓ1(sin ˆθv
0 −sin θv
0)

,
(1)
where all and veh represent all grid cells and vehicle fore-
ground grid cells, respectively, ˆpc equals 1 for foreground
cells and 0 otherwise, ℓ1 is smooth-L1 loss (with the transi-
tion value set to 0.1).
In addition to the detection loss, end-to-end models also
optimize for the prediction loss that is only applied to future
waypoints of the actors. Moreover, we model the multi-
modality of the predictions [10] by classifying three modes

for each actor (i.e., turning left, turning right, or going
straight), where a separate trajectory is regressed for each
mode along with the corresponding mode probability [7]
based on the focal loss. In addition, regression loss is ap-
plied only to the trajectory mode that is closest to the ob-
served trajectory. Then, the prediction loss is given as
Lpred =
X
v∈veh
M=3
X
m=1

ℓf(ˆpv
m, pv
m)
+ 1 ˆm=m
H
TH
X
t=T1
 ℓ1(ˆcv
xt −cv
xmt) + ℓ1(ˆcv
yt −cv
ymt)
+ ℓ1(cos ˆθv
t −cos θv
mt) + ℓ1(sin ˆθv
t −sin θv
mt)

,
(2)
where pv
m denotes probability of m-th trajectory mode of
actor v, 1c is an indicator function equaling 1 if the condi-
tion c holds and 0 otherwise, and ˆm indicates an index of
the mode closest to the ground truth. Future centroids and
headings are relative to the cell center and the AV heading,
respectively (see Fig. 1a), while they are in the actor frames
in the two-stage models (see Fig. 1b-c). Then, Ldet and
Lpred can be optimized together in a joint training.
For single-stage models the detection and prediction val-
ues are both optimized in the ﬁrst stage (Fig. 1a). On the
other hand, when the ﬁrst stage serves as a part of the two-
stage architecture (Fig. 1b-c), Ldet is optimized as a part of
the ﬁrst-stage output while Lpred is optimized in the second
stage, discussed in the remainder of this section.
3.3. Interaction using convolutions implicitly
In the previous section we discussed the ﬁrst-stage fea-
ture extraction, that computes per-actor grid features which
are then used as an input to the second-stage models to pre-
dict future motion. In this section we discuss how to com-
pute the per-actor features better at capturing interactions:
• To capture relationship to nearby actors for the actor
for whom the future trajectories are predicted (called
the actor of interest), an input of the forecasting mod-
ule can be a region covering the interacting actors and
objects on the feature map, instead of only using the
feature pixel. For the trafﬁc use-case this interaction
region (IR) should cover the area within which the ob-
jects should be paid attention to. Our results show that
for vehicle actors, a large region ahead of the vehicle
provides good context to model interaction.
• To overcome rotational variance of convolutions, in-
stead of cropping the IR features in the coordinate
frame of the original BEV grid whose orientation is
determined by AV, we deﬁne IR in the frame of the
actor of interest (i.e., actor frame), in which the out-
put trajectories are also deﬁned (commonly referred to
as RROI [29]). Our results conﬁrm the importance of
rotational invariance in modeling interactions.
• To effectively propagate non-local information of the
interacting actors to the actor of interest, we can use an
interactive CNN (ICNN) consisting of a few downsam-
pling convolutional layers that eventually condense an
IR comprising the actor of interest itself, its surround-
ing actors, and the environment, into a feature vector
used as the ﬁnal feature for this actor.
As mentioned, the actor-centric feature map and the
RROI techniques have been utilized in a number of appli-
cations [5, 10, 11], where it was found to lower displace-
ment errors in trajectory forecasting tasks. In this paper we
demonstrate that, by combining these ideas, convolutions
are effective in modeling spatial interactions as well. More-
over, as shown by our experiments, by varying the param-
eters of these ingredients one can control the level of in-
teraction modeling, providing further evidence that spatial
interactions can be effectively captured by convolutions.
The implementation of these three components are illus-
trated in the dashed box in Fig. 1b, which we refer to as
the interaction convolutional module (ICM). For each actor
we deﬁne a square IR around it, which is then used to crop
actor-centric features from the global feature map using bi-
linear interpolation. We vary the size, orientation, and the
position of the actor in the IR to study their effects on the
performance of interaction modeling (e.g., in the extreme
case where the IR has no area, the cropped feature is just
the feature pixel on the feature map). We choose a square
IR to simplify the discussion. The length of the square side
is referred to as IR size in the following discussion. Simi-
larly, the ICNN module always consists of six ConvBs and
one ResB to gradually reduce the cropped feature map to
a 1D feature vector fc (e.g., if the crop size is 32 × 32,
setting the strides of the last ﬁve ConvBs to 2 yields a 1D
vector; see Appendix for detailed discussion on crop sizes
and ICNN design). The ﬁnal multimodal classiﬁcation and
future trajectory regression in the actor frame are obtained
from this 1D vector via a single FC layer, one for each task.
3.4. Interaction using graphs explicitly
The purely convolutional approach described in the pre-
vious section provides implicit interaction modeling. To ex-
plicitly account for interactions, a common approach is the
use of GNNs, discussed in this section. As there exist many
variants, we choose one of the more general approaches,
the message passing neural network [15,39], which has also
been adapted to the motion forecasting problem [5].
Indicated by the dashed box in Fig. 1c, a fully connected
graph comprises all of the V actors (represented as nodes),
with bi-directional edges between every two actors. The

Figure 2. Schematic of interaction loss. The actor (blue) is ap-
proximated with 3 costing circles (green), with minimal distances
(black) to an obstacle (grey) and resulting gradients (red).
feature attribute ni of the i-th node is initialized by
n0
i = MLPinit(fci),
(3)
where fci is the ﬁnal feature vector of the i-th actor com-
puted in the previous section. All multi-layer perceptrons
(MLPs) in this GNN have two layers. The message passing
at the k-th iteration via edge from node j to i is given by
mk
j→i = MLPk
e([nk
i , relj→i, nk
j , reli→j]),
(4)
where [·] denotes concatenation. Unlike the implicit convo-
lutional approach in the previous section where the relative
spatial relation of actors are intrinsically represented within
the crop, spatial relationships are additionally required in a
graph representation. The relative geometric feature relj→i
consisting of the coordinates and heading of actor j in the
frame of actor i, is computed as
relj→i = MLPrel([xj→i, yj→i, cos θj→i, sin θj→i]).
(5)
All of the messages sent to the i-th graph node are aggre-
gated by a max-pooling operation, denoted as
mk
i = Poolj(mk
j→i).
(6)
Finally, the node attribute is updated with a Gated Recurrent
Unit (GRU) [5, 15, 39] whose hidden state is nk
i and the
input is mk
i ,
nk+1
i
= GRU(nk
i , mk
i ).
(7)
In general, the update iterates for K times. Finally, multi-
modal classiﬁcation and future trajectories for the actor are
computed from nk+1
i
, as discussed in Section 3.3.
3.5. Interaction loss
In this section we introduce a novel interaction loss to
improve interaction awareness of the model, which directly
penalizes predicted forecast of an actor that overlap with
static trafﬁc objects (deﬁned as objects with speed less than
0.2m/s).
Trafﬁc objects comprise objects that a vehicle
should avoid, including vehicles, cyclists, pedestrians, con-
struction fences, etc. At each prediction horizon, the pre-
dicted actor is approximated with 3 inscribed costing cir-
cles, as illustrated in Fig. 2. The loss is then computed as
Lcol =
1
3V H
V
X
v
N
X
n
H
X
t=1
L=3
X
l=1
max(0, Rvl −dvntl), (8)
where V , N, H, and L are the numbers of actors, non-
moving obstacles, prediction time horizons, and costing cir-
cles, respectively. Rvl is a radius of a costing circle (deter-
mined by the size of a ground-truth bounding box), while
dvntl is a signed minimum distance between the l-th costing
circle center of the v-th actor and the n-th obstacle bound-
ing box at time t. The distance is negative when the center
is inside the obstacle’s bounding box.
Note that the loss only considers overlaps between pre-
dicted trajectories and the ground-truth bounding boxes of
static obstacles. Moving actors may have multimodal tra-
jectory distributions, and it can be unclear when an overlap
between the trajectories of two moving actors should be pe-
nalized by the loss. In summary, when the costing circles
overlap with an obstacle bounding box, the interaction loss
would only back-propagate gradients through the predicted
centroid and heading. The loss is added to the prediction
loss Lpred where it is applied to the ˆm-th predicted trajec-
tory, and optimized jointly in the end-to-end training.
4. EXPERIMENTS
Input and output.
The considered area is of size
150 × 100 × 3.2m, centered at AV and discretized as a
960 × 640 × 16 grid into which the LiDAR sweep informa-
tion is encoded. The input contains 10 LiDAR sweeps col-
lected at 0.1s interval, as well as a semantic HD map from
the current timestamp. The models detect the vehicle actors
at the current time step and forecast their trajectories at fu-
ture time horizons t ∈{0.1, 0.2, . . . , 4.0s}. Non-maximum
suppression (NMS) [31] with Intersection over Union (IoU)
threshold set at 0.1 is applied in order to eliminate duplicate
detections.
Metrics. The studies are focused on prediction accuracy
and interaction performance. IoU threshold for object de-
tection matching is set at 0.5. We observe that the detection
performance changes little in all of the considered models
reported in the paper, with average precision at 94.0 ± 0.4.
Furthermore, we ensure equal numbers of trajectories are
considered in the metrics by adjusting the detection proba-
bility threshold at a ﬁxed recall of 0.8 [5]. Each actor has 3
predicted trajectory modes, and we assign the trajectory of
the most probable mode to the actor in the following met-
ric computation. We use displacement error (DE) at 4s to
measure the prediction accuracy, averaged over all actors.
To quantify the interaction performance of the models
we consider two overlap metrics in our experiments (addi-
tional results of other metrics are provided in Appendix):

Figure 3. Effects of the ICM components: interaction region (IR), interaction CNN (ICNN), and actor frame (AF). Extractor is the single-
stage model; +IR+ICNN represents the two-stage model that deﬁnes IRs in the AV frame; +ICM (i.e., +IR+ICNN+AF) is the proposed
two-stage model that deﬁnes IRs in the actor frame. All IRs have a ﬁxed 5:1 front-to-back ratio, unless speciﬁed otherwise. Inset: models
with a ﬁxed IR size of 60m and varied front-to-back ratios.
• Actor-actor overlap rate is the percentage of predicted
trajectories of detected actors overlapping with pre-
dicted trajectories of other detected actors.
• Actor-static overlap rate is the percentage of pre-
dicted trajectories of detected actors overlapping with
ground-truth static trafﬁc objects.
An actor overlap is deﬁned as an intersection-over-obstacle-
polygon of more than 0.05 at any point of the 4s-long tra-
jectory, set to this value to eliminate false positive overlaps
due to small noise in the labeled bounding boxes.
Data.
We conducted an evaluation on a large in-house
data set, containing 19,000 scenes of 25s each and collected
across several cities in North America with high-quality
10Hz annotations. To mitigate the metric variance of the
sparse overlaps, (1) a large split of 5,000 scenes is left out
for testing; (2) the test frames in scenes have a temporal
spacing of 2s to avoid counting the same overlaps multiple
times; and (3) the training and test sets are split geograph-
ically to prevent models from memorizing the same static
obstacles and environment. Using this larger data set, as
opposed to using popular open-sourced data sets that are
signiﬁcantly smaller, allows for lower metric variance and
deriving more general conclusions.
Lastly, by using the
same input, backbone network, loss functions, and train-
ing settings, our studies contrast the interaction modeling
approaches. This allows us to focus on the relative perfor-
mance of these approaches, as opposed to comparing inde-
pendent models where ensuring similar network capacities
and equally well-tuned hyper-parameters are typically chal-
lenging tasks. In Appendix we provide details on the data
sets, metric variance, and comparison to other motion fore-
casting models on public data sets.
4.1. Results
Interaction using convolutions. The performance of
the single-stage model (Fig. 1a) that contains only the fea-
ture extractor is shown in Fig. 3 (Extractor, black). The
+IR+ICNN (green) curve shows the performance of the
two-stage model without rotating the interaction region for
each actor into the actor frame. In particular, starting from
the 1D per-actor feature map vectors (0m), we increase the
IR size to 80m. By cropping larger feature map regions that
contain more interacting actors and surrounding context,
displacement error and forecasted overlap rates decrease.
We then rotate IRs to match estimated actor orientation
instead of using the common AV frame (+ICM, blue). For
zero IR size (i.e., a cropped feature is still the feature pixel),
we observe DE drops signiﬁcantly compared to the model
using the AV frame with zero IR size (green). This has been
explained previously as a beneﬁt of a standardized output
representation [10]. Although deﬁning the IR in the actor
frame reduces rotational variance, the zero-size IR covers
no interacting actors and we thus observe little change in
the actor overlap rates. Here a lower DE is not less corre-
lated to better interaction modeling. As the IR is increased
in size, both DE and the interaction metrics improve dra-
matically. Crop sizes of beyond 60m show no further im-
provement, likely because the majority of interacting actors
and obstacles are already included within the 60m region.
In all of the IRs above we have ﬁxed the front-to-back
ratio to 5:1, meaning an IR of size 60m includes 50m ahead
and 10m behind the actor. In Fig. 3 inset we ﬁx the to-
tal size at 60m, and vary the front-to-back ratio (blue). As
the vast majority of actors are moving forward, we can see
that placing more of the IR ahead of the actor improves in-
teraction modeling. It is interesting to note the divergence
between DE and overlap rates again: after the front-to-
back ratio is above 1:1, the overlap rates continue to drop
marginally, while the DE improvement stops. Even for the
actor-centered IR (inset, green), not rotating the IR to match
the actor orientation yields worse DE and overlap rates,
which further conﬁrms the importance of removing rota-
tional variance for interaction modeling using convolutions.
From Fig. 3, we observe that by cropping an actor-frame
deﬁned region of the feature map and then applying convo-

Figure 4. Performance from adding a GNN on top of the ICM (Fig. 1c) for different ICM interaction region sizes. +GNN (only attributes)
encodes only node attributes in the graph edges; +GNN (only relative) encodes only relative locations and orientations in the graph edges.
Figure 5. Comparison of ICM and GNN (which includes ICM). +GNN (no edges) is identical to +GNN except the graph edges are cut off.
+IL represents models trained with the additional interaction loss. Note that comparing +ICM at large IR (interaction is modeled by ICM)
against +GNN at small IR (interaction is modeled by GNN) shows that a pure ICM can outperform a pure GNN in modeling interactions.
lutions improves forecasting and interaction modeling con-
siderably. Strong dependence of overlap rates on IR size
provides evidence that convolutions are effectively captur-
ing interactions once other actors are inside the IR.
Interaction using graphs. As illustrated in Fig. 1c, for
these experiments we add a GNN after the ICM. Note that,
as discussed earlier, setting the IR size to 0m deactivates
the ICM while retaining the beneﬁt of reduced rotational
variance. For zero IR size (Fig. 4, +GNN, red) we see
that the GNN indeed improves DE and overlap rates signif-
icantly as compared to the models without designated in-
teraction modeling capability in Fig. 3 (+ICM, 0m). No-
tably, even when GNN is utilized, we observe that ICM
can still provide additional performance improvements as
we gradually increase ICM’s interaction modeling by ex-
panding the IR size. We also examine the beneﬁt of the
hand-crafted relative geometries in the graph edges. When
the IR is small (i.e., ICM is limited), keeping only the node
attributes ni (blue) or relative geometries reli,j (green) sig-
niﬁcantly damages the graph modeling. For large IR sizes,
the difference between the three graph models becomes mi-
nor, suggesting that with larger feature crops the ICM has
effectively compensated for missing GNN features.
The GNNs in the models above are single-iteration. We
also evaluated the effect of increasing the GNN iterations to
K = 2. An additional iteration (i.e., K = 2) reduces DE
and overlap rates further by a small amount when the IR
size is small, which could be explained by the well-known
bottleneck phenomenon of GNNs [2] and the fact that the
graph is fully connected. This improvement is negligible
for all but the smallest IRs, and no further exploration of
additional iterations is provided below.
Convolutions vs. graphs for interaction. In Fig. 5 we
compare the implicit ICM (blue) and explicit GNN (red) ap-
proaches. With zero IR size (where ICM is effectively off)
the gain of adding GNN is signiﬁcant. However, as the IR
grows, we observe that the performance gap steadily nar-
rows. In other words, while turning on ICM (by increasing
IR size) can further improve the performance of GNN mod-
els, adding a graph to an ICM with a sufﬁciently large IR
provides only minor beneﬁts. To understand the gaps be-
tween +ICM and +GNN with large IR sizes, we study a
graph-less model (+GNN (no edge), black) created by re-
moving graph edges in +GNN. For large IRs, the graph-less
model matches the performance of +GNN, suggesting that
explicit interaction graph of GNN contributes little to the
performance. Thus, the gaps between +ICM and +GNN
for larger IR sizes are mainly due to extra network capacity
of GNN. Lastly, we see that comparing +ICM at large IR
(i.e., interaction modeled by ICM) against +GNN at small
IR (i.e., interaction modeled by GNN) shows that a pure
ICM can outperform a pure GNN in modeling interactions.

Figure 6. Predicted trajectories sampled at 2Hz of baseline (top)
and ICM (bottom). red: overlapped obstacles; blue: forecasts of
the actors of interest; grey: forecasts of other actors; green: labels
(also see attached videos in the Appendix).
Interaction loss. We can also see that adding the inter-
action loss (Eq. 8) reduces the overlap between actors’ pre-
dicted trajectories for both interaction modeling approaches
(green and magenta in Fig. 5). The improvement is sig-
niﬁcant for smaller IRs, which may be due to the fact that
the smaller IRs do not provide enough information to model
the interactions effectively, beneﬁting more from this added
supervision. On the other hand, when interaction is mod-
eled more effectively through larger IR, the loss is sparser
and thus contributes less. Interestingly, the interaction loss
does not affect DE results except for ICM models at small
IR where the interaction modeling is limited.
Maneuver-speciﬁc qualitative results. In Fig. 6 we
present a comparison of the baseline ICM model with 0m
size (that has no designated interaction modeling) and the
ICM model with 60m size on three typical maneuvers
observed in interacting scenarios: adaptive cruise control
(ACC), turn, and nudging. We note that the 0m model in all
cases incorrectly predicts overlapping trajectories. In the
ACC case the ICM model correctly predicts that the vehicle
would decelerate and queue after others, while in the turn
case it outputs a trajectory that follows the lane and avoids
overlapping with the vehicles after the turn. In the nudging
case the vehicle motion starts with considerable curvature,
the forecast correctly reduces the curvature and straightens
the trajectory to avoid the parked cars. We also examined
the results of adding GNNs on +ICM (60m) on these ma-
neuvers, and observed no signiﬁcant difference.
Inference time.
The baseline model that includes
the feature extractor and other parts such as input pre-
processing and output post-processing takes 45.6ms per
frame. Next we measure the additional time costs of adding
the ICM and GNN modules to the baseline model, shown in
Table 1. The ICM of zero IR size adds an additional 5.2ms,
which includes processing of the feature pixel and computa-
tion of the ﬁnal output. ICM with non-zero size uses convo-
lutions and bilinear feature cropping, operations that have
been optimized in current GPU software and hardware. As
a result, even the largest 80m ICM is only a few millisec-
onds slower than the 0m ICM. Lastly, the GNN itself takes
Table 1. Inference times of modules (tested on Nvidia Titan RTX)
Module
IR size [m]
Inference [ms]
ICM
0
5.2
ICM
80
8.1
GNN
-
46.9
46.9ms, multiple times slower than the slowest ICM. This is
consistent with earlier results showing GNN inference may
be inefﬁcient resulting in higher latency [23]. Coupled with
the earlier results showing that modeling interaction using
convolutions can give competitive performance compared
to GNNs, we see that the convolutional approach represents
an efﬁcient and practical alternative to GNNs.
5. LIMITATIONS AND SOCIAL IMPACT
While graphs represent a general approach to model-
ing various relations, this work shows that convolutions are
also effective when it comes to modeling spatial interac-
tion. However, in Euclidean space where CNNs are ap-
plied, some information such as driver-driver vocal inter-
actions are not currently represented. Besides, as our stud-
ies are limited to the 2D AV application, it is unanswered
whether convolutions are still effective in capturing interac-
tions in 3D space (such as for modeling human body mo-
tion). Additionally, the overlap rates over trajectories en-
able us to explicitly evaluate quality of the interaction mod-
eling. However, the metric sparsity leads to requirement
for having larger test data to ensure the obtained results are
meaningful, which limits a wider use of the metric. Lastly,
neither of the considered approaches guarantees zero over-
laps over the trajectories, and CNN models also often have
lower interpretability than the GNN approaches.
6. CONCLUSION
We compared and contrasted 2D convolutional and
graph neural networks for the task of spatial interaction
modeling, providing empirical evidence that under certain
conditions convolutional networks can reach comparable
performance to the state-of-the-art GNNs (e.g., by modify-
ing IR), thus allowing similar motion forecasting accuracy
and interaction modeling while maintaining reduced latency
and model complexity of the model. We analyzed common
components of the interaction approaches, leading to a bet-
ter understanding of how each beneﬁts the interaction mod-
eling. Moreover, we introduced a novel interaction-aware
loss and showed its impact on the considered approaches.
Our work presents a basis for wider use of convolutional
layers for the task of spatial interaction, providing evidence
that the gap between convolutional models and more com-
plex and computationally expensive GNN models may not
be as large as previously suspected.

References
[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan,
Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. So-
cial lstm: Human trajectory prediction in crowded spaces. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 961–971, 2016. 2
[2] Uri Alon and Eran Yahav. On the bottleneck of graph neu-
ral networks and its practical implications. arXiv preprint
arXiv:2006.05205, 2021. 7
[3] Peter W Battaglia, Razvan Pascanu, Matthew Lai, Danilo
Rezende, and Koray Kavukcuoglu. Interaction networks for
learning about objects, relations and physics. arXiv preprint
arXiv:1612.00222, 2016. 1
[4] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi-
ancarlo Baldan, and Oscar Beijbom.
nuscenes: A multi-
modal dataset for autonomous driving. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 11621–11631, 2020. 13
[5] Sergio Casas, Cole Gulino, Renjie Liao, and Raquel Ur-
tasun.
Spatially-aware graph neural networks for rela-
tional behavior forecasting from sensor data. arXiv preprint
arXiv:1910.08233, 2019. 1, 2, 4, 5, 13
[6] Sergio Casas, Wenjie Luo, and Raquel Urtasun. Intentnet:
Learning to predict intention from raw sensor data. In Con-
ference on Robot Learning, pages 947–956, 2018. 2
[7] Henggang Cui, Vladan Radosavljevic, Fang-Chieh Chou,
Tsung-Han Lin, Thi Nguyen, Tzu-Kuo Huang, Jeff Schnei-
der, and Nemanja Djuric. Multimodal trajectory predictions
for autonomous driving using deep convolutional networks.
In 2019 International Conference on Robotics and Automa-
tion (ICRA), pages 2090–2096. IEEE, 2019. 4
[8] Nachiket Deo and Mohan M. Trivedi.
Convolutional
social pooling for vehicle trajectory prediction.
CoRR,
abs/1805.06771, 2018. 2
[9] Frederik Diehl, Thomas Brunner, Michael Truong-Le, and
Alois C. Knoll. Graph neural networks for modelling trafﬁc
participant interaction. CoRR, abs/1903.01254, 2019. 2
[10] Nemanja Djuric, Henggang Cui, Zhaoen Su, Shangxuan Wu,
Huahua Wang, Fang-Chieh Chou, Luisa San Martin, Song
Feng, Rui Hu, Yang Xu, et al. Multinet: Multiclass multi-
stage multimodal motion prediction. In Proceedings of the
IEEE Intelligent Vehicles Symposium (IV), 2020. 2, 3, 4, 6,
13
[11] Nemanja Djuric, Vladan Radosavljevic, Henggang Cui, Thi
Nguyen, Fang-Chieh Chou, Tsung-Han Lin, and Jeff Schnei-
der. Short-term motion prediction of trafﬁc actors for au-
tonomous driving using deep convolutional networks. arXiv
preprint arXiv:1808.05819, 2018. 2, 4
[12] Francis Engelmann, Theodora Kontogianni, and Bastian
Leibe. Dilated point convolutions: On the receptive ﬁeld size
of point convolutions on 3d point clouds. In 2020 IEEE In-
ternational Conference on Robotics and Automation (ICRA),
pages 9463–9469. IEEE, 2020. 1
[13] Alex Fout, Jonathon Byrd, Basir Shariat, and Asa Ben-Hur.
Protein interface prediction using graph convolutional net-
works. In Advances in neural information processing sys-
tems, pages 6530–6539, 2017. 1
[14] Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir
Anguelov, Congcong Li, and Cordelia Schmid. Vectornet:
Encoding hd maps and agent dynamics from vectorized rep-
resentation. arXiv preprint arXiv:2005.04259, 2020. 2
[15] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol
Vinyals, and George E Dahl. Neural message passing for
quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
4, 5
[16] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese,
and Alexandre Alahi.
Social GAN: socially acceptable
trajectories with generative adversarial networks.
CoRR,
abs/1803.10892, 2018. 2
[17] Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and
Yuji Matsumoto. Knowledge transfer for out-of-knowledge-
base entities: A graph neural network approach.
arXiv
preprint arXiv:1706.05674, 2017. 1
[18] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive
representation learning on large graphs. In Advances in neu-
ral information processing systems, pages 1024–1034, 2017.
1
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deep residual networks. In European
conference on computer vision, pages 630–645. Springer,
2016. 3
[20] Boris Ivanovic and Marco Pavone. Modeling multimodal dy-
namic spatiotemporal graphs. CoRR, abs/1810.05993, 2018.
2
[21] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and
Le Song. Learning combinatorial optimization algorithms
over graphs. In Advances in neural information processing
systems, pages 6348–6358, 2017. 1
[22] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization.
arXiv preprint arXiv:1412.6980,
2014. 12
[23] Kevin Kiningham, Christopher Re, and Philip Levis. Grip: A
graph neural network accelerator architecture. arXiv preprint
arXiv:2007.13828, 2020. 8
[24] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max
Welling, and Richard Zemel. Neural relational inference for
interacting systems. In International Conference on Machine
Learning, pages 2688–2697. PMLR, 2018. 1, 2
[25] Thomas N Kipf and Max Welling. Semi-supervised classi-
ﬁcation with graph convolutional networks. arXiv preprint
arXiv:1609.02907, 2016. 1
[26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal
loss for dense object detection. In ICCV, 2017. 3
[27] Wenjie Luo, Bin Yang, and Raquel Urtasun. Fast and furi-
ous: Real time end-to-end 3d detection, tracking and motion
forecasting with a single convolutional net. In Proc. of the
IEEE CVPR, pages 3569–3577, 2018. 2
[28] Sindy L¨owe, David Madras, Richard Zemel, and Max
Welling.
Amortized causal discovery:
Learning to in-
fer causal graphs from time-series data.
arXiv preprint
arXiv:2006.10833, 2020. 1

[29] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong
Wang, Yingbin Zheng, and Xiangyang Xue.
Arbitrary-
oriented scene text detection via rotation proposals. CoRR,
abs/1703.01086, 2017. 2, 4
[30] Eli A. Meirom, Haggai Maron, Shie Mannor, and Gal
Chechik. How to stop epidemics: Controlling graph dynam-
ics with reinforcement learning and graph neural networks.
arXiv preprint arXiv:2010.05313, 2020. 1
[31] Alexander Neubeck and Luc Van Gool.
Efﬁcient non-
maximum suppression. In 18th International Conference on
Pattern Recognition (ICPR’06), volume 3, pages 850–855.
IEEE, 2006. 5
[32] Adam Paszke, Sam Gross, et al.
Pytorch: An imperative
style, high-performance deep learning library. In H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and
R. Garnett, editors, Advances in Neural Information Process-
ing Systems 32, pages 8024–8035. Curran Associates, Inc.,
2019. 12
[33] Shah Rukh Qasim, Jan Kieseler, Yutaro Iiyama, and Maur-
izio Pierini. Learning representations of irregular particle-
detector geometry with distance-weighted graph networks.
The European Physical Journal C, 79(7), Jul 2019. 1
[34] Nicholas Rhinehart, Rowan McAllister, Kris M. Kitani, and
Sergey Levine. PRECOG: prediction conditioned on goals
in visual multi-agent settings. CoRR, abs/1905.01296, 2019.
1, 2
[35] Amir Sadeghian, Ferdinand Legros, Maxime Voisin, Ricky
Vesel, Alexandre Alahi, and Silvio Savarese. Car-net: Clair-
voyant attentive recurrent network. CoRR, abs/1711.10061,
2017. 1, 2, 13
[36] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Sprin-
genberg, Josh Merel, Martin Riedmiller, Raia Hadsell, and
Peter Battaglia. Graph networks as learnable physics engines
for inference and control. arXiv preprint arXiv:1806.01242,
2018. 1
[37] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne
Van Den Berg, Ivan Titov, and Max Welling. Modeling rela-
tional data with graph convolutional networks. In European
semantic web conference, pages 593–607. Springer, 2018. 1
[38] Zhaoen Su, Chao Wang, Henggang Cui, Nemanja Djuric,
Carlos Vallespi-Gonzalez, and David Bradley. Temporally-
continuous probabilistic prediction using polynomial trajec-
tory parameterization. In IEEE International Conference on
Robotics and Automation (IROS), 2021. 3, 13
[39] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-
ing He. Non-local neural networks. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 7794–7803, 2018. 4, 5
[40] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar,
and Kannan Achan.
Inductive representation learning on
temporal graphs.
arXiv preprint arXiv:2002.07962, 2020.
1

Convolutions for Spatial Interaction Modeling
- Appendix -
In the appendix, we provide complete details on the model and experiment implementation to facilitate reproducing the
proposed approaches and models: the network design of the ﬁrst stage backbone (Section A), the ICNN network designs
(Section B), the GNN network design (Section C), the training setup (Section D), the data set choices (E), and the metric
design and metric variances (Section F).
We also provide additional quantitative and qualitative results: the experimental results focused on forecasted actor trajec-
tories intersecting with non-vehicle trafﬁc objects (Section G) and videos (Section H).
A. The extractor backbone network
160, /1
128, /2
96, /4
32, /4
64, /8
96, /16
64, /8
64, /16
9, /1
16, /2
16, /4
32, /4
64, /8
96, /16
32, /8
32, /8
32, /4
64, /8
96, /16
+
+
+
Cross-scale block
32, /4
64, /8
96, /16
128, /16
256, /4
x3
256, /4
256, /4
256, /4
128, /8
128, /8
128, /8
256, /4
256, /8
256, /4
256, /4
+
+
Voxelized LiDAR 
point clouds
Rasterized 
map
C, /S
Operation
Tensor
Element-wise 
addition
+
Upsampling
ConvB
Feature map
Figure S1. Multi-scale network design of the feature extractor. As illustrated in Fig. 1, the inputs are voxelized LiDAR point clouds and
rasterized map, while the network output is a feature map of size (256, W/4, L/4), where W and L are the grid width and length of
the input BEV representation, respectively. The green boxes labeled as C, /S represent tensors where C and S represent the number of
channels and down-sampled scale relative to the input size, respectively. The operations connecting two tensors are ConvBs, except for the
speciﬁed up-sampling, element-wise addition, and the cross-scale block. The cross-scale block, detailed in Fig. S2, is repeated 3 times.
In Fig. S1 we provide full and detailed design of the CNN feature extractor used in all of the models studied (see the high-
level overview in Fig. 1). We note that the multi-scale design (as indicated by /1, /2, /4, /8, and /16, where the numbers
represent the down-sampling scales relative to the input size) and cross-scale blocks (see Fig. S2) already encourage a
large receptive ﬁeld. Nevertheless, the experimental results presented in the main paper show that such single-stage CNN
architecture still models the spatial interaction less effectively. By adding either the shallow ICNN or the GNN module in the
second stage the interaction modeling performance is signiﬁcantly improved.

32, /4
ResB
x4
32, /4
ConvB
ConvB
ConvB
32, /4
64, /8
96, /16
64, /8
ResB
64, /8
ConvB
ConvB
ConvB
32, /8
64, /8
96, /16
96, /16
ResB
96, /16
ConvB
ConvB
ConvB
32, /16
64, /16
96, /16
Upsample
32, /4
Upsample
32, /4
Upsample
64, /8
32, /4
64, /8
96, /16
ReLU
+
32, /4
32, /4
ReLU
+
64, /8
64, /8
ReLU
+
96, /16
96, /16
x8
x6
C, /S
Operation
Tensor
 x4       
Element-wise 
addition
+
Repeat 4 times
Figure S2. The cross-scale block
B. The ICNN network
For the IRs equal to 80m, 60m, 40m, 20m, and 5m, we set the grid sizes of the feature map crops to 64, 48, 32, 16 and 4,
respectively. Zero-valued padding is utilized in the convolutional layers when necessary.
We did not extensively investigate network designs for the ICNN module. Several straightforward options (see Fig. S3)
that stacked ConvB and ResB blocks in series were evaluated empirically. These options set the strides of the last few ConvB
blocks to 2 so that the input feature map crop was down-sampled gradually to 1 × 1 after being processed by the ICNN. We
observed that the model performance was not sensitive to the changes in these ICNN variants.
C. The GNN network
Two-layer MLPs are used in a few places in the GNN networks of this work. All output vectors through the MLPs have
the identical dimension as fv, 256. In other words, the dimensions remain unchanged through the MLPs.
Note that both max-pooling and mean-pooling were studied in the GNNs (see Eq. 6) and no experimental difference
was observed. We also explored adding other relative relations such as relative velocities to the graph edge attribute and
observed insigniﬁcant changes to the model performances. Besides, in the main text all node and edge attributes were based
on deterministic model outputs. We studied using probabilistic (Gaussian and Laplace) outputs for the attributes, and only
measured performance difference within the metric variance level. Note that the GNNs experimented in this work built edges
between all vehicle actors. It was clear that some of the edges were unnecessary (e.g, between two far-apart vehicles that
were driving farther apart). However, given the high vehicle speed and long prediction horizon of 4s, such optimization was
not straightforward and thus not experimented with in more depth.
D. The training setup
Each training sequential example comprises 10 past and current sweeps (−0.9s, −0.8s, ..., 0s), and 41 current and future
timestamps for ground-truth supervision (0s, 0.1s, . .., 4.0s). The frame at current timestamp is referred to as the key frame.
Each scene on the in-house data set is 25s long, producing at most 200 complete sequential examples. We trained all of the
models with decimated key frames in the training split once (i.e., every sequential example whose key frame is at t, t + 0.2s,
t + 0.4s, . . ., is used once during model training).
The models were implemented in PyTorch [32] and trained end-to-end with 16 GPUs (Nvidia RTX 2080), with a batch
size of 2 per-GPU. Training without the GNN module is completed in about 12 hours. We use the Adam optimizer [22] with
a learning rate of 2e-4, decayed to 2e-5 at 75% and 2e-6 at 95% of the training iterations.
For the models with the interaction loss, the weight of the interaction loss was set to 2.0, which slightly outperformed
those with weights of 1.0 and 3.0 in the interaction metrics. No signiﬁcant displacement error difference was observed.

C, 32x32
ConvB
C, 32x32
ConvB
C,16x16
ResB
C, 16x16
ConvB
C, 8x8
ConvB
C, 4x4
ConvB
C, 2x2
ConvB
C, 1x1
C, 32x32
ConvB
C, 32x32
ConvB
C, 32x32
ConvB
C, 16x16
ConvB
C, 8x8
ConvB
C, 4x4
ConvB
C, 2x2
ConvB
C, 1x1
C, 32x32
ConvB
C, 32x32
ConvB
C, 32x32
ConvB
C,16x16
ConvB
C, 8x8
ConvB
C, 4x4
ResB
C, 4x4
ConvB
C, 2x2
ConvB
C, 1x1
(a)
               (b)                    (c)
Figure S3. Various considered designs for ICNN, using 32×32 input as an example. The green boxes (C, W × W) represent tensors of
grid size equal to W × W and C channels (C is equal to 256 in all designs). The presented main results were based on design (a).
E. The data
In the Methodology Section, we brieﬂy explained that large data was required to conduct experiments with low metric
variances and derive general conclusion. Here we provide more details of the in-house data set and its comparison to some
open-sourced data sets.
As shown by the experimental results, the overlap rates are low, particularly for models equipped with high abilities in
modeling interaction. This means large and diverse test data is required to achieve low metric variances. Take the popular
nuScenes data set [4] for autonomous driving as an example, its training, validation, and test sets for prediction task combined
have 1000 scenes (each is 20s long). On the in-house data, our work uses a test set of 5000 scenes (each is 25s long), which
can help facilitate the comparison and analysis of ﬁne differences (e.g., those between +GNN and +GNN (no edges) at large
IRs, see metric variances in the next section).
In pure trajectory prediction tasks where the metric variance of displacement error is low even based on smaller data sets,
we observed that the methodological comparison concluded on this in-house data set was generalized to other AV data sets.
Speciﬁcally, in our earlier trajectory prediction works [10, 38] we found the results on this in-house data set to correlate
with the smaller open-sourced data sets. Comparing the trajectory prediction performance on the nuScenes data, it was also
observed that the 3s displacement errors (vehicles) were 1.58, 1.45, and 1.04m for CAR-Net [5, 35], SpaGNN [5], and our
conv-only model (+ICM (IR=60m)), respectively. For fair comparison, all models were measured at a same detection recall
of 60%.
Finally, we conﬁrm that the observation that the convolutional approach performed comparably to or better than GNN
remained valid with smaller data subsets. E.g., trained with 1/3 of the full training set (training time is also 1/3 long) and
evaluated on the same large test set (5000 scenes), we measured displacement errors of 0.818, 0.641, and 0.714m in the
baseline, +ICM (IR=80m), and +GNN (IR=0m) models, respectively. Their actor-actor overlap rates were 3.26, 1.04, and
1.70%; actor-static overlap rates were 1.22, 0.49, and 0.66%, respectively. These results suggest that the empirical studies
presented in this work are generally valid.
We complete the discussion on data with two ﬁnal notes. The ﬁrst note is about the labeling standard for object detection.
Actors that were far away from the trafﬁc regions (e.g., roads and side-walks) were excluded during training and performance
evaluation, which could lead to a higher detection performance than those based on data sets that considered all actors in the
scene. The second note is about the distribution of the multimodal predictions, where approximately 80% of the moving
vehicles were in the “straight”-driving mode (as opposed to left and right turns).

Figure S4. Number of ground-truth label overlaps of all timestamps vs. IoP through the test set. Here we only consider label trajectories
whose boxes at the key-frame match the detections of the baseline model. We threshold the matching in terms of IoU at 0.1, 0.5, and 0.75
on the plot. The number of overlaps is non-zero even at high IoP, because large overlaps indeed occur in this data set, for instance, when
the arm of a construction vehicle is over another vehicle, which would be a large overlap in the bird’s eye view.
F. The metrics
The overlap interaction metrics are deﬁned in terms of the ratio of intersection between two objects to the the area of the
smaller object (IoP) rather than the more common IoU, because the latter is insensitive to an overlap between a large object
and a small object. Although there are no collisions between actors in the data set, we have measured overlap rates between
actor labels because some labeling boxes are slightly and imprecisely over-sized (thus the learnt object detections would be
over-sized too). In Figs. S4 we plot the number of overlaps between label actors as a function of IoP threshold. The number
drops rapidly after IoP at 0.05 approximately, meaning thresholding IoP at 0.05 would just eliminate the majority of the false
positive overlaps. We adapted this threshold value in evaluating the overlaps of predicted trajectories. Such setting allowed
the metrics to measure interaction modeling in the experiments robustly and indicatively.
The proposed convolutional approach can be applied to improve interaction modeling for other trafﬁc participants such
as pedestrians and cyclists. We observed signiﬁcant effects in case studies. The studies were not discussed in this work
because even labels of pedestrians and cyclists often had considerable and arguably correct overlaps. As a result, the overlap
rates were no longer unambiguous metrics for interaction modeling. We thus limited our discussion to interaction between
vehicles and vehicles, and between vehicles and static obstacles.
By including these careful designs for the data and metrics, the resulting metrics variances were 0.004m, 2e-4, and 3e-4
for DE, actor-actor, and actor-static collision rates, respectively, computed by training +ICM (80m) 4 times (other models
were trained and evaluated once). This level of variances allowed us to resolve difference in interaction modeling between
models, even on rare events such as overlaps.
G. Additional results focusing on overlaps with non-vehicle actors
The actor-static overlap rate in the main paper considers overlaps between forecasted trajectories with both vehicle and
static non-vehicle trafﬁc objects. In this section, we provide additional results focusing on overlap with static non-vehicle
trafﬁc objects. Here the overlap rate is deﬁned as the percentage of forecasted trajectories of detected actors that overlap with

Figure S5. Overlap rate of forecasted actor trajectories overlapping with static non-vehicle trafﬁc objects.
ground-truth static non-vehicle trafﬁc objects. The three panels in Fig. S5 correspond to Figs. 3 - 5 in the main paper. As
the feature map input cropped by the IR covers features of both vehicle and non-vehicle trafﬁc objects in the ICM approach,
it is not surprising that ICM effectively improves this interaction metric too. It is, however, interesting to note that even
though GNN does not build nodes for the non-vehicle trafﬁc objects in the graph, it also lowers this overlap rate by 24%,
by comparing +ICM (0m) to +GNN (0m). The reduction is attributed to the fact that by avoiding overlaps with vehicles
(after adding the GNN), the overlaps with some of the non-vehicle objects near those vehicles are also avoided. Another
factor may be the proximity effect of CNNs, as the pixel features of the vehicle actors might comprise information about its
nearby non-vehicle objects. The improvement of GNN on the overlap avoidance with non-vehicle objects (24%), however, is
considerably lower than that with vehicle actors (42% as shown in the main paper by comparing +ICM (0m) to +GNN (0m)
in Fig. 5 right), which is reasonable as the GNN does not model the interactions with non-vehicle objects directly.
H. Videos
In addition to Fig. 6, we provide qualitative results with three videos1 where the predictions of the baseline (+ICM, 0m)
are on the left and the predictions of the ICM model (+ICM, 60m) are on the right. The overlapped obstacles are ﬁlled in red,
the ground-truth trajectories are in grey, and the forecasted trajectories are in blue. Trajectory visualization is downsampled
to 2Hz for clarity. Different from Fig. 6, we visualize the predictions of all actors in the common AV frame of reference. Note
that the videos are 20s long, because each scene in the data set is 25s long, where the ﬁrst second is used for the 10-sweep
input and the last four seconds are used for the 4s forecasting time horizon.
1https://youtube.com/playlist?list=PLbI8u9Kk9gFyWIP7T9aWWvoO6nrEUAs1W
